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Abstract: The article discusses the use of machine learning (ML) to combat phishing 
websites, which are deceptive sites that mimic trusted entities to steal sensitive 
information. This is why the continued invention of methods of identifying and 
counteracting phishing threats is beneficial. Such attacks pose significant risks to the 
integrity of online security. To enhance the success rate and specificity of predicting 
phishing websites, this study proposes a new approach that utilizes machine learning 
algorithms. To enhance the methods mentioned above and achieve better results in 
classification and better prediction of customer behaviour, the main points exposed to 
further transformations are increasing classifier accuracy and selecting an optimal 
feature space. Traditional anti-phishing strategies like blacklisting and heuristic 
searches often have slow detection times and high false positive rates. The article 
introduces a novel feature selection method to extract highly correlated features from 
datasets, thereby enhancing classifier accuracy. Using six feature selection techniques 
on a phishing dataset, it evaluates eight classifiers, including SVM, Logistic 
Regression, Random Forest, and others. The study finds that the Random Forest 
classifier combined with the Chi-2 feature selection method significantly improves 
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Introduction 

Computer viruses and biological viruses share a 
fundamental similarity in their intent to spread and 
replicate, although they operate in distinct realms. 
Computer viruses are malicious software programs 
designed to infect and disrupt computer systems by 
copying themselves and modifying other programs. On 
the other hand, biological viruses invade living cells to 
reproduce and spread, causing harm to their hosts 
(Franji¢, 2020). In the context of cybersecurity, anti- 
phishing software has been developed to combat phishing 
attacks (Srinivas et al., 2019), which are similar to viruses 
in how they propagate. However, these programs often 
fail to detect all types of phishing attacks, as these 
frequently involve deceptive web pages rather than 
executable programs, exploiting vulnerabilities to steal 
sensitive data. Phishing attacks typically initiate through 
digital interactions such as emails or social media 


model accuracy, achieving up to 96.99%. 


messages. Like biological viruses that activate through 
interaction with the host (Korkmaz et al., 2020), phishing 
attacks use these communications as a medium to spread. 
The strategy involves deceiving victims into providing 
personal information like credit card details and 
passwords (Oest et al., 2018). Over time, phishing 
techniques evolve to avoid detection (Gupta et al., 2022), 
often by mimicking legitimate websites and using 
credible URLs or email addresses to appear trustworthy. 
This evolution is akin to how biological viruses mutate to 
evade the immune response of their hosts. The severity of 
phishing attacks is underscored by a report from 
SlashNext, which noted a 61% increase in phishing 
incidents since 2021, identifying 255 million attacks in a 
six-month period across various digital platforms (Zhong 
and Sastry, 2017). This highlights the pervasive and 
escalating threat of phishing, demonstrating the 
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substantial challenge it poses in both personal and Since 2003, global agencies have been working 
organizational security contexts. together to mitigate the impact of phishing URLs, 


Table 1. Features selection by different feature selection algorithms. 


Feature Pearson Logistics pues LightGBM Total 
Forest 
1 web_traffic TRUE TRUE | TRUE TRUE TRUE TRUE 6 
2 having_sub_domain TRUE TRUE | TRUE TRUE TRUE TRUE 6 
3 having_IP TRUE TRUE | TRUE TRUE TRUE TRUE 6 
4 URL_of_Anchor TRUE TRUE | TRUE TRUE TRUE TRUE 6 
5 SSLfinal_state TRUE TRUE | TRUE TRUE TRUE TRUE 6 
6 Links_in_tags TRUE TRUE | TRUE TRUE TRUE TRUE 6 
7 Google_Index TRUE TRUE | TRUE TRUE TRUE TRUE 6 
8 SFH TRUE TRUE | TRUE TRUE TRUE FALSE 5 
9 Prefix_Suffix TRUE TRUE | TRUE TRUE TRUE FALSE 5 
10 Shortining_Service TRUE TRUE | TRUE TRUE FALSE FALSE 4 
11 Request_URL TRUE TRUE | TRUE | FALSE TRUE FALSE 4 
12 Redirect FALSE | TRUE | TRUE TRUE FALSE TRUE 4 
13 Links-pointing_to_page FALSE | FALSE | TRUE TRUE TRUE TRUE 4 
14 DNS_Record TRUE TRUE | TRUE | FALSE FALSE TRUE 4 
15 having_At_symbol TRUE TRUE | TRUE | FALSE FALSE FALSE 3 
16 age_of_domain TRUE TRUE | FALSE | FALSE TRUE FALSE 3 
17 statistical_report TRUE TRUE | TRUE | FALSE FALSE FALSE 3 
18 Page_Rank TRUE TRUE | FALSE | FALSE FALSE TRUE 3 
19 Domain_registration_length | TRUE TRUE | FALSE | FALSE TRUE FALSE 3 
20 URL_Length TRUE TRUE | FALSE | FALSE FALSE FALSE 2 
21 Abnormal_URL TRUE TRUE | FALSE | FALSE FALSE FALSE 2 
22 Port FALSE | FALSE |} TRUE | FALSE FALSE FALSE 1 
23 on_mouseover TRUE | FALSE | FALSE | FALSE FALSE FALSE 1 
24 Submitting_to_email FALSE | FALSE} TRUE | FALSE FALSE FALSE 1 
25 Iframe FALSE | FALSE |} TRUE | FALSE FALSE FALSE 1 
26 HTTPS_token FALSE | FALSE |} TRUE | FALSE FALSE FALSE 1 
27 popUpWindow FALSE | FALSE | FALSE | FALSE FALSE FALSE 0 
28 double_slash_redirect FALSE | FALSE | FALSE | FALSE FALSE FALSE 0 
29 Right_Click FALSE | FALSE | FALSE | FALSE FALSE FALSE 0 
30 Favicon FALSE | FALSE | FALSE | FALSE FALSE FALSE 0 
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Figure 1. Phishing URL distribution between 2020 and 2022. 
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emphasizing the need for literacy and education in 
combating such cyber threats (Awasthi and Goel, 202 1a). 


Table 3. Number of Features Selected by Feature 
Selection Algorithms. 


Professionals and academic research play critical roles in S.No. Pearson | Chi-2. REE | Logistics [Random Light 
these preventive measures. The aim of this study is to | ESE [ACB 
enhance the accuracy of phishing website detection 
(Awasthi and Goel, 2021b). This involves a two-phase 
process where the first phase uses eight machine learning 2 i2 P2423 72 £2 £2 
classifiers to analyze the dataset. The second phase 
employs six feature selection algorithms along with 3 £3 |f£3/f3) £3 £3 £3 
machine learning classifiers to refine the data 
representation. The objective here is to identify key 4 f4 f4|f4 f4 f4 f4 
rents that improve the accuracy of classifiers While 5 £5 £5 /£5 £5 £5 £5 
using a reduced set of features. The study provides 
detailed tables listing the selected features of each 6 fe | $6.6 | #5 <5 5 
algorithm, along with their descriptions and 
abbreviations, to illustrate the methodology and results 7 £7 f_7 | £7 £7 £7 £7 
systematically. 
Table 2. Names of Features and their Abbreviations. 8 £8 |f£8)f8) £8 f8 | f12 
Feature Name Abbreviation 
web: 4rathic f 1 9 £9 |f£9}|f9) £9 f_9 | f.13 
bayme sep aoman f2 10 | £.10 |f£10]/f.10] £10 | far | £14 
having_IP f_3 
URL_of_Anchor f_4 11 f_ll |f_l1)f_11) f.12 f_13 | f.18 
SSLfinal_state £_5 
Links_in_tags PG 12 f_14 |f.12)/f.12] f.13 f_16 - 
Google_Index £7 13 | £15 |f14)f13) - | £19 | - 
SFH £8 
Prefix_Suffix £_9 14 f_16 |f_15/f_14 - - - 
Shortining_Service f_10 
Request_URL fl IS | £17 |f16)/f15) — - : - 
Redirect f_l2 16 §6| £.18 [f17/f£.17] — - : : 
Links-pointing_to_page f_13 
DNS_ Record f_14 17 f_19 | f_18 |f£_22 - - - 
having_At_symbol f_15 
age_of_domain f_16 18 £20 | £19 |f_24 7 7 ° 
statistical_report f_17 19 £21 |¢.20/£.25 : ; ; 
Page_Rank f_18 
Domain_registration_length f_19 20 f_23 |f.21|f.26 - - - 
URL_Length f_20 No of 
Abnormal “URE fl features! 20 | 20 | 20] 12 3 | i 
port £22 Selected 
op ises yer £23 The article continues by reviewing previous methods 
Submitting to_email f_24 for detecting phishing websites in Section 2. Section 3 
Iframe f_25 provides a detailed overview of the experimental setup 
HTTPS_token f_26 and its rationale. Information about the dataset and its 
popUpWindow f_27 attributes is presented in Section 4. The results of the 
double_slash_redirect f_28 experiment are analyzed in Section 5, while Sections 6 
Right_Click f_29 and 7 conclude the study and discuss its implications. 
Favicon f_30 
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Literature Review 

This section explores cutting-edge machine-learning 
techniques for detecting phishing websites. Initially, early 
methods involved constructing basic feature sets from 
URL word lists using the bag-of-words approach (Le et 
al., 2011). Feng et al. (2018) introduced a more advanced 
method by proposing a novel neural network optimized 
for phishing detection through 
principles, enhancing the model's ability to generalize 
across different scenarios. They tested their model on a 
substantial dataset from the UCI repository, consisting of 
11,055 samples labeled as either legitimate or phishing 
and featuring 30 different attributes per website, 
encompassing domains, exceptions, HTML, JavaScript, 
and address bar elements. Further advancing the field, 
Muhammad et al. (2012) focused on the systematic 
extraction of URL features and the development of 
hierarchical classifiers. Their method emphasizes the 
automation of phishing detection, highlighting that 
although incorporating third-party service features may 
slow down the process, it significantly improves the 
accuracy of detection (Sahingoz et al., 2019). This 
approach underscores a pivotal shift towards more 
precise and automated methods in phishing website 
classification. The research examined the effectiveness of 
a proposed algorithm by testing it on 1,407 legitimate and 
2,119 phishing websites from the Alexa database and 
PhishTank, respectively. This work highlighted the 
constraints of traditional rule-based feature selection and 
modeling, particularly in generalizing to previously 
unseen URLs, prompting a shift towards deep learning- 
based phishing detection (Iuga et al., 2016). Deep 
learning, known for its ability to model complex 
functions using large datasets, automates feature selection 
(Zhao et al., 2018; Singh and Singh, 2023; Banerjee et 
al., 2023) using word-level features and techniques like 
recurrent neural networks (Bahnsen et al., 2017). 
Muhammad et al. (2014) further advanced this field by 
developing a novel self-structured neural network (NN) 
specifically for identifying phishing websites. They 
evaluated this network using 17 distinctive signatures 
derived from 800 phishing and 600 legitimate websites 
sourced from PhishTank and Millersmiles, incorporating 
some data from third-party services. Their studies 
demonstrated the robustness and adaptability of neural 
networks in detecting phishing. They explored a 
backpropagation-trained feedforward neural network 
(Mohammad et al., 2013; Dawn et al., 2023) for further 
classifying websites. A significant advancement in 
phishing detection involved focusing on character-level 
features from URLs, recognizing that language and 


risk minimization 
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sentiment can be discerned from character sequences 
(Zhang et al., 2015). This shift towards character-level 
analysis reduces the need for extensive feature selection 
or preprocessing, allowing researchers to optimize 
computational efficiency and structural design in deep 
learning models. 

Jain and Gupta (2018) proposed a machine learning- 
based method to detect phishing websites, focusing 
exclusively on client-side features. They utilized 19 
specific features derived from URLs and source code to 
assess their approach. The evaluation involved testing on 
2,141 phishing pages from PhishTank and Openfish, 
alongside 1,918 legitimate pages from the Alexa database 
and several online payment and banking websites (Jain 
and Gupta, 2018). A key part of their study was 
examining the impact of data augmentation on phishing 
URL detection performance through the use of generative 
adversarial networks (GANS). 

Despite the breadth of features used in various studies 
for phishing detection, it has been noted that some 
features may not be adequate for reliably identifying 
phishing attempts (Anand et al., 2018). The selection of 
the most effective features has not been a primary focus 
in the field. To address this, Rajab advocated for the use 
of correlated feature sets and information gain to enhance 
phishing site identification (Rajab, 2018). In an analysis 
using the UCI repository, information gain and 
correlation-based feature selection methods were used to 
identify the most relevant features—11 and 9 features 
were selected out of 30, respectively, across 11,055 
samples. The effectiveness of these selected features was 
further validated using the data mining algorithm 
RIPPER, showcasing a methodical approach to refine 
phishing detection through strategic feature selection. Bu 
and Cho employed an unsupervised learning method to 
tackle phishing attacks, uncovering significant class 
imbalances in the classification of phishing URLs (Bu 
and Cho, 2021). In a similar vein, Babagoli et al. utilized 
a comparable dataset and recommended the use of 
decision trees and wrapper methods for feature selection, 
ultimately selecting 20 features (Le et al., 2018). They 
further enhanced their approach with a _ novel 
metaheuristic-based nonlinear regression technique to 
evaluate phishing site performance (Babagoli et al., 
2018). 

However, these feature selection methods depend 
heavily on the underlying data and require the setting of 
user-specified thresholds, which can __ significantly 
influence the final performance of the classification 
algorithms, especially when features are selected from 
data not seen during the training phase. 
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Figure 2. Experiment's flow diagram. 


In a notable advancement, Microsoft developed a deep 
learning model that enhances phishing attack detection by 
integrating both character-level and word-level features 
(Tajaddodianfar et al., 2020). This model employs deep 
learning techniques like the self-attention mechanism to 
refine the URL feature set, making it one of the most 
accurate and reliable phishing detection methods 
currently available. Further improving upon this, Bu and 
Cho have optimized performance by using expert 
knowledge-based feature sets along with character- and 
word-level URL features (Bu and Cho, 2021). 
Additionally, first-order logic-based rules have been 
employed to correct outputs from deep learning 
classifiers, highlighting the ongoing efforts to refine the 
feature set for phishing detection. The integration of deep 
learning with traditional machine learning algorithms and 
genetic algorithms has also been explored for enhancing 
performance (Pal et al., 2023; Kumar et al., 2023; Yadav 
and Singh, 2023; Jain et al., 2024). For instance, 
Suleiman et al. (2019) have boosted the accuracy of 
various classifiers, including Naive Bayes (NB), k- 
Nearest Neighbors (k-NN), Decision Trees (DT), and 
Random Forests (RF), by incorporating evolutionary 
computation-based feature selection algorithms into these 
traditional machine learning frameworks. Similarly, Park 
et al. (2021) have leveraged genetic algorithms to 
improve the discovery rules, thereby increasing both the 
precision and recall of deep learning classifiers and 
enhancing overall detection performance. These 
developments underscore a multi-faceted approach to 
improving phishing website detection through innovative 
combinations of machine learning techniques. 
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Experimental Methodology 

In this section, the focus is on the experimental setup 
where various machine learning classifiers were 
evaluated both before and after implementing a feature 
selection process (Cai et al., 2017). Six different feature 
selection methods were utilized, each determining the 
optimal number of features to use for enhancing classifier 
performance. The architecture of the proposed method is 
illustrated comprehensively in Figure 2, which depicts the 
comparative results obtained from the classifiers with and 
without feature selection, as well as the specific number 
of features selected by each feature selection algorithm. 
This section provides a concise overview of all the 
machine learning classifiers and feature selection 
algorithms used in the experiment. This description aims 
to offer a clear understanding of how each classifier and 
feature selection method contributes to the overall 
effectiveness of the phishing detection process, 
highlighting the improvements in performance achieved 
through the strategic reduction of features. 
Machine Learning Classifiers 

Algorithms called machine learning classifiers are 
used to group data according to input characteristic into 
predetermined groups or categories (Awasthi and Goel, 
2021c). Support Vector Machine (SVM) is a common 
classifier; it finds the best hyperplane to divide classes; 
Logistic Regression (LR) models the probability of class 
membership using the logistic function; Random Forest 
(RF) constructs multiple decision trees and combines 
their outputs for improved accuracy; AdaBoost divides 
data into branches based on feature values and combines 
weak classifiers to form a strong classifier (); Decision 
Tree (DT) divides data into branches; K-Nearest 


Neighbors (K-NN) classifies based on the majority class 
among the nearest neighbors; and Gradient Boosting 
Classifier (GBC) builds models sequentially to correct 
the errors of the previous ones. 

Support vector machine (SVM) 

An _ effective supervised learning approach for 
regression and classification problems is called a Support 
Vector Machine (SVM). It operates by determining 
which hyperplane in the feature space best divides the 
data points of various classes (Taher et al., 2018). Using 
kernel functions such as linear, polynomial, and radial 
basis function (RBF) to shift the input space into higher 
dimensions where a linear separator is more effective, 
SVM can handle both linear and non-linear data. 
Maximizing the margin, or the distance, between the 
closest support vector data points from each class and the 
hyperplane is the aim. This maximizing enhances the 
model's resilience to novel, untested data and its capacity 
for generalization. SVM works very well in high- 
dimensional areas and situations. 

Logistic Regression (LR) 

For binary classification problems, one popular 
Statistical technique is logistic regression (LR). In 
contrast to linear regression, which forecasts continuous 
results, logistic regression (LR) models the likelihood 
that an input falls into a certain class (Thabtah et al. 
2019). The logistic function, sometimes referred to as the 
Sigmoid function, is used to achieve this. It maps 
expected values to a probability range between 0 and 1. 
By using maximum likelihood estimation to estimate the 
coefficients for each input feature, the model is able to 
ascertain how each feature affects the result. 
Effectiveness, readability, and simplicity are the main 
benefits of logistic regression, particularly when there is a 
linear connection between the target variable's log-odds 
and its characteristics (Josephine et al., 2021). It's often 
used in situations like forecasting binary results for things 
like spam identification and illness presence. 

Random Forest (RF) 

In order to generate many decision trees during 
training and provide the mode of the classes (for 
classification) or mean prediction (for regression) of the 
individual trees, Random Forest (RF) is an ensemble 
learning technique used for classification and regression 
problems. It combines the ideas of random feature 
selection, which selects a random subset of features for 
splitting at each node in the tree, and bagging (Sun et al., 
2017), which creates several subsets of the dataset by 
random sampling with replacement. By averaging out 
individual tree biases, this method enhances the model's 
generalization and decreases overfitting. Because every 
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tree in the forest has received individual training, the 
combined result is a forecast that is more reliable and 
accurate. Random Forest is renowned for its exceptional 
precision and capacity to manage the dataset. 

AdaBoost 

Adaptive Boosting, or AdaBoost, is an ensemble 
learning technique that builds a strong classifier by 
aggregating the results of many weak classifiers. It 
operates by gradually training on the dataset weak 
classifiers, which are usually decision trees (Bansal et al., 
2022). Every classifier concentrates on the examples that 
the preceding ones misclassified. Each training instance 
receives a weight throughout this procedure, which raises 
the weight of examples that are erroneously categorized 
so that later classifiers will give them more consideration. 
The final model is a robust classifier that decreases 
overfitting and increases accuracy, which is derived from 
the weighted sum of the weak classifiers. Continuing this 
iterative approach, each classifier focuses on the 
challenging cases to produce a strong final model that 
includes the best features of all the weak ones. 

Decision Tree (DT) 

Decision trees are a popular method in supervised 
machine learning, where they are used to model decisions 
and their possible consequences, similar to a flowchart. 
This algorithm splits the data into branches at decision 
nodes, which represent tests on certain attributes. Each 
split is based on the attribute that results in the most 
distinct separation of the data into groups based on the 
target variable (Ggttcke et al., 2021). The structure of a 
decision tree includes two main elements: decision nodes 
and leaves. Decision nodes are the points where the data 
is split. Each decision node represents a question based 
on an attribute, with the branches from the node 
answering this question. Leaves, on the other hand, 
represent the final outcomes or decisions. 
k-Nearest Neighbor (k-NN) 

The k-nearest neighbor classifier is one method for 
nonparametric supervised machine learning. It relies on 
distance: It classifies objects according to the classes of 
their closest neighbors. The most common application for 
KNN is classification, but it can also be used to solve 
regression issues. Labels in the training set serve as a 
guide for learning in a supervised model (Tekouabou et 
al., 2020). Check out our in-depth explanation of the 
principles of supervised learning for a_ better 
understanding of how it works. It is suitable for data 
where the relationship between the independent variable 
and the dependent variable is not a straight line, rather 
than simple models like linear regression. 


Gradient boosting classifier (GBC) 

In Gradient Boosting, each successive predictor aims 
to improve upon its predecessor by reducing the 
prediction error. Unlike traditional methods where a 
predictor is fit directly to the data, Gradient Boosting 
takes a unique approach by fitting a new predictor to the 
residual errors made by previous predictors (Awasthi and 
Goel, 2022). This iterative process begins with an initial 
prediction based on the dataset, often calculated by taking 
the logarithm of the probability of the target feature. 
Typically, this is done by dividing the number of true 
outcomes by the number of false outcomes. Each new 
predictor then focuses on correcting the mistakes of the 
preceding model, refining the overall prediction accuracy 
with each iteration. This strategy of incrementally 
correcting errors makes Gradient Boosting a powerful 
technique for building highly accurate predictive models. 
Feature Selection Algorithms 

Feature selection in machine learning is the process of 
taking out features that are noisy, redundant, or 
unnecessary in order to find the most relevant subset of 
the original set. This procedure is essential for enhancing 
classifier accuracy since it concentrates on the most 
important elements. Six different feature selection 
techniques were used in this research to determine which 
traits were most important and relevant. Through the 
removal of less valuable data, these techniques sought to 
improve the classifier's accuracy (Abdul Khalek et al., 
2019). Table 1 provides a comprehensive list of all the 
features selected using these various feature selection 
techniques. By applying these methods, the study seeks to 
streamline the dataset, thereby optimizing the 
performance of the machine learning models used for 
classification. 

Pearson correlation 

Pearson Correlation creates a matrix measuring the 
linear association between features, providing values 
from -1 to 1. It evaluates the relationship between each 
feature and the target variable to identify the feature with 
the greatest impact on the target (Ali et al., 2019). 


os aa = DOr 7) 
OG = 2) Dee 


r =Where n is the number of records in the dataset, x 
is the average value of the sample attribute, x is the i™ 


value of the variable, and y is the target variable. 1 
indicates a correlation, -1 indicates a correlation, and 0 
indicates no correlation. 
Chi-2 

The chi-2 test was used to verify the independence of 
attributes in statistical models (Li et al., 2022). The model 
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measures the difference between expected and actual 
responses. A lower Chi-2 value indicates that the 
variables are less dependent on one another, while a 
higher value indicates a greater correlation. The null 
hypothesis is based on the initial assumption that the 
attributes are distinct from one another. The following 
formula is used to determine the value of the expected 
result: 
E, = PQ Ny) = P(X) x POW) 
The following expression can be used to calculate the 


chi-square: 
“10;-E 
= > _ 
x ; E; 
i=1 


Where, i > range from | to n, 

n — dataset records, 

O; — actual outcome, 

E, — the expected outcome 
Recursive feature elimination (RFE) 

The individual properties of features and how they 
interact with one another are the primary focus of the 
fundamental methods for selecting features. Based on 
variance and the correlation between them, some 


examples of methods that remove unnecessary features 
include variance thresholding and pairwise feature 
selection. However, a more practical strategy would 
choose features based on how they affect the performance 
of a particular model. By removing features one at a time 
until the optimal number of features are left, it reduces 
model complexity. Recursive Feature Elimination, also 
known as RFE Feature Selection, is a method of selecting 
features that cuts down on the complexity of a model by 
picking the most important ones and removing the 
weaker ones (Chen et al., 2018). The selection procedure 
eliminates these less important characteristics one at a 
time until it reaches the optimal number required for 
optimal performance. The model's dependencies and 
collinear ties are then removed by recursively removing a 
small number of features per loop. The number of 
features reduced by recursive feature elimination results 
in an increase in model efficiency. 

Logistic Regression (LR) 

Logistic regression establishes a relationship between 
predictor variables and the probability of an outcome 
using the Sigmoid function instead of a linear function 
like in Linear Regression (Alsouda et al., 2019). This 
makes it suitable for binary classification tasks, where it 
models the probability of a particular class. 


log log (255) = Po + Bix 
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p(x) p(x) ‘ 
Where, rr ha odd term and log log (22) — logit 
or log-odds function. 
Random Forest (RF) 


A supervised model called Random Forest employs 
both decision trees and bagging (Awasthi and Goel, 
2022). The idea is to resample the training dataset using a 
technique called "bootstrap". Fit a decision tree with each 
sample containing a random subset of the original 
columns. Based on its ability to increase the purity of its 
leaves, each Random Forest tree is able to determine the 
importance of features. The importance of this feature 
increases with leaf purity. This is done for each tree, 
averaged over all trees, and then normalized to | at the 
end. As a result, the random forest's importance scores all 
add up to 1. 

LightGBM 

A gradient boosting framework called Light GBM 
makes use of a tree-based learning algorithm (Rufo et al., 
2021). The tree is grown vertically by Light GBM and 
horizontally by another algorithm. As a result, Light 
GBM creates trees one layer at a time. 


Experimental Setup 

The used dataset comes from the Kaggle Repository's 
Phishing website dataset (Phishing website dataset | 
Kaggle, https://www.kaggle.com/datasets). The phishing 
dataset has 32 features; the feature with the name Index 
has been removed because it only contains serial 
numbers. Table 2 shows that of the 31 features, 30 are 
independent and 1 is dependent. The Result is the final 
feature, indicating whether the website is phishing (1) or 
legitimate (0). As depicted in Figure 3, there are 4898 


legitimate websites and 6157 phishing websites. 


Phishing Detection 
| 


= ° 
Phishing/Non_Phishing 
Figure 3. Phishing and legitimate websites. 


Results 
The findings in this manuscript are based on analyses 
performed both prior to and following feature selection. 
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By comparing these results, we can evaluate if utilizing a 
reduced set of features leads to better performance 
compared to using the full feature set. We begin by 
examining the outcomes derived from all features (f1, f2, 
..., £30), as detailed in Table 3. These results encompass 
evaluations of accuracy, recall, precision, Fl-score, and 
confusion matrices, along with the feature correlation 
matrix and ROC curve analysis. These thorough 
evaluations serve as the cornerstone of our conclusions. 
Results before feature selection 

A correlation matrix was first constructed to examine 
the relationships between the coefficients of various 
variables (Qiu et al., 2021). This matrix summarizes the 
phishing dataset and helps identify and visualize patterns 
within the data. It illustrates the correlation between all 
31 pairs of feature values in a tabular format, with 
variables displayed in rows and columns. The correlation 
coefficient for each pair can be found in the 
corresponding cell of the table. Additionally, the 
correlation matrix is often used alongside other types of 
statistical analysis. 

Figure 4 demonstrates that the ranks of the 12 
features—f5, f4, f1, £9, f2, f11, f6, f19, f8, f7, £16, and 
f18—are highly correlated. In the next step, we applied 
various machine learning classifiers to our dataset with 
all features. As previously mentioned, a range of 
classifiers was used to predict accuracy based on the 
dataset. Table 4 presents the results of several 
experiments involving machine __learning-based 
classification of the dataset's features. For the evaluation 
and comparison of the learning algorithms, the dataset 
was divided into two parts: 80% was used for training 
and 20% for testing. To ensure the robustness of our 
evaluation, K-fold cross-validation was employed to 
validate the dataset. This method allowed for a 
comprehensive assessment of each algorithm's 
performance by repeatedly training and testing on 
different subsets of the data, thus minimizing the 
potential for bias and improving the reliability of our 
results. After training, the dataset was tested using 
different machine learning classifiers. At this stage, 
various algorithms were applied to distinguish between 
phishing and non-phishing website URLs. The dataset 
performed well across the eight machine learning 
classifications. This initial stream experiment, conducted 
before feature selection, aimed to obtain results from 
straightforward classification. Both RF and DT classifiers 
achieved the highest accuracy—96.06%—on the test 
dataset, resulting in a tie. Table 4 illustrates the training 
and testing outcomes across various classifiers. 
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Figure 4. Correlation matrix. 


Table 4. Accuracy (Train and Test) of the classifiers with all features. 


SVM SVM Ada 

Sey (kernel='linear') (kernel='rbf') pe a Boost 
Train 92.84% 95.41% 92.94% | 99.06% | 93.96% 99.06% | 96.55% | 95.28% 
Test 92.85% 94.71% 92.40% | 96.74% | 93.58% 95.97% | 94.08% | 95.07% 
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Figure 5 depicts the corresponding outcome. To 
evaluate Recall, Precision, Specificity, Accuracy, and 
AUC-ROC (Area Under the Receiver Operating 
Characteristic Curve), the confusion matrix was 
employed. The confusion matrix is a table that displays 
the four possible outcomes of predicted and actual values: 
True Positives (TP), False Positives (FP), True Negatives 
(TN), and False Negatives (FN). This matrix is crucial for 
calculating various performance metrics that provide a 
comprehensive understanding of the model's 
effectiveness. 


(kernel='linear') 


SVM 


(kernel='rbf') 


SVM RF 


Table 6 presents the precision, recall, and F1 scores for 
phishing and legitimate URLs across both the training 
and testing datasets. Additionally, the confusion matrix 
scores have been extracted for these datasets. On the test 
dataset, the RF classifier stands out with a precision of 
96.31%, a recall of 98.00%, and an Fl-score of 97.15%. 
The validation score for the RF classifier is also very 
similar to that of the DT classifier. Furthermore, the 
confusion matrix-based results are closely aligned with 
those of the DT classifier. 


AdaBoost 


HE Test emmmeTrain 


Figure 5. Visualization of classifier accuracy across all features during training and testing. 


Table 5. Metrics for validation used in the experiment. 


Validation measures | | 


Using formula 


Precision True Positive (TP) 
True Positive (TP) + False Positve (FP) 
Recall True Positive (TP) 
True Positive (TP) + False Negative (FN) 
Fl-score 


Precision x Recall 
Precison + Recall 


Confusion Matrix 


Phishing URE correctly 
predicted asphishing ACTUAL VALUES 


by the model) 


PREDICTED 
VALUES 


Phishing: URMincorreetly 
predicted) as Kegitinvate 
bythe model) 


Legitimate URLincorrectly 
predicted as phishing: 
by theymodel 


Legitimate URE correctly 
predictedias Legitimate 
bythe model 
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Table 6. Performance Metrics (Train and Test) of the classifiers with all features. 
precision recall fl-score Confusion Matrix 
Train Train Train 
Classifiers (%) Test (%) (%) Test (%) Test (%) | 


(%) Train Test 


o{j1 6 | 1 |jo | a jie |ja jjo [ja [jo | 2 | 


SVM [[3573 [[ 861 
(inci ines 93. | 92. | 93. | 92. | 92. | 94. | 90. | 94. | 91. | 93. | 91. | 93. 369] 95] 
r) 11 | 63 | 18 | 61 | 63 | 61 | 06 | 98 | 86 | 61 | 59 | 78 [ 264 [ 63 
4638]] 1192]] 
[[3685 [[ 882 
SVM 96. | 94. | 95. | 94. | 93. | 96. | 92. | 96. | 94. | 95. | 93. | 95. 257] 7A] 
(kernel=’rbf?) | 11 | 87 | 35 | 24 | 48 | 96 | 25 | 57 | 77 | 90 | 77 | 39 [ 149 [ 43 
4753]] 1212]] 
[[3579 [[ 864 
LR 93. | 92. | 91. | 92. | 90. | 94. | 90. | 93. | 91. | 93. | OL. | 93. 363] 92] 
20 | 74 | 91 | 76 | 79 | 67 | 37 | 94 | 98 | 70 | 13 | 34 [ 261 [ 76 
4641]] 1179]] 
[[3888 [[ 909 
RE 99. | 98. | 97. | 96. | 98. | 99. | 95. | 98. | 98. | 99. | 96. | 97. 54] 47] 
25 | 90 | 32 | 31 | 63 | 40 | 08 | 00 | 94 | 15 |] 19 | 15 [ 29 [ 25 
4873]] 1230]] 
[[3624 [[ 874 
per ee 94. | 93. | 93. | 93. | OL. | 95. ] 91. | 95. | 93. | 94. | 92. | 94, 318] 82] 
37 | 64 | 57 | 57 | 93 | 59 | 42 | 21 13 | 60 | 48 | 39 [ 216 [ 60 
4686]] 1195]] 
[[3898 [[ 910 
DT 99. | 99. | 95. | 96. | 98. | 99. | 95. | 96. | 98. | 99. | 95. | 96. 44] 46] 
00 | 10 | 48 | 34 | 88 | 20 | 18 | 57 | 94 | 15 | 33 | 45 [ 39 [ 43 
4863]] 1212]] 
[[3768 [[ 884 
K-NN 96. | 96. | 93. | 94. | 95. | 97. | 92. | 95. | 96. | 96. | 93. | 94. 174] 72) 
64 | 48 | 74 | 32 | 58 | 32 | 46 | 29 | 11 | 90 | 10 | 80 [ 131 [ 59 
4771]] 1196]] 
[[3692 [[ 891 
GBC 95. | 94. | 95. | 94. | 93. | 96. | 93. | 96. | 94. | 95. | 94. | 95. 250] 65] 
67 | 98 | 29 | 90 | 65 | 59 | 20 | 49 | 65 | 78 | 23 | 69 [ 167 [ 44 
4735]] 1211]] 


ROC Curve Analysis ROC Curve Analysis 
1.0 4.0 
09 09 
zr) 08 @ 08 
a oO 
te oF te OF 
g 06 © 06 
= o5 = 05 
6 6 
ao 04 ao 04 
o 03 o 
5 g 03 
F 02 F 02 
O41 =z = 01 — - ——— : 
as 2 —— RandomForestClassifier, AUC=0.995 a0 ie — DecisionTreeClassifier, AUC=0.974 
00 Of 02 O38 04 05 06 O7 08 O09 10 00 O1 O02 O03 04 05 O06 OF O08 O98 10 
False Positive Rate False Positive Rate 


Figure 6. RF and DT ROC (AUC) curves. 
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Figure 7. Accuracy of the classifiers for various feature counts. 


Table 7. Classifier accuracy (Train and Test) for various feature selections. 


Pearson Chi-2 


Classifiers 


RFE 
Features = 20 | Features = 20 | Features = 20 


Random 


LightGBM 
Forest aon 


Features = 
11 


Logistics 


Features = 12 Features = 


SVM 

(kernel=’linear’ | 92.32 | 92.45 | 92.35 | 92.58 | 92.88 | 92.72 | 92.24 | 92.49 | 92.11 | 92.31 | 90.76 | 91.54 
) 

SVM 

95.30 | 94.66 | 95.15 | 94.66 | 95.26 | 94.53 | 94.02 | 93.62 | 94.75 | 94.71 | 94.32 | 93.89 
(kernel=’rbf’) 

LR 92.59 | 92.49 | 92.74 | 92.54 | 92.93 | 92.49 | 92.39 | 92.40 | 92.21 | 92.36 | 91.35 | 91.54 
RF 98.56 | 96.07 | 98.63 | 96.99 | 97.91 | 96.25 | 96.53 | 95.02 | 97.64 | 96.02 | 96.63 | 94.57 
AdaBoost 93.22 | 93.40 | 93.49 | 93.31 | 93.69 | 93.40 | 93.32 | 93.80 | 93.28 | 93.03 | 91.90 | 92.54 
DT 98.56 | 95.57 | 98.63 | 95.79 | 97.91 | 95.57 | 96.53 | 94.80 | 97.64 | 95.25 | 96.63 | 93.80 
K-NN 96.12 | 93.40 | 95.87 | 93.08 | 95.67 | 94.08 | 95.18 | 93.44 | 95.50 | 93.26 | 95.01 | 92.94 
GBC 94.43 | 94.62 | 94.60 | 94.53 | 94.78 | 94.62 | 94.19 | 94.21 | 94.46 | 94.35 | 94.05 | 93.62 


The ROC (AUC) curve offers a comprehensive 
measure of performance across all classification 
thresholds. As demonstrated in the previous results, 
metrics such as accuracy, precision, recall, the Fl-score, 
and the confusion matrix have very similar scores. 
Therefore, additional clarification of the results based on 
these metrics is necessary. Figure 6, derived from these 
metrics, shows that the RF classifier achieves a higher 
ROC (AUC) score compared to the DT classifier. 

Results after feature selection 

Feature selection algorithms have garnered significant 
attention across a wide range of applications. These 
algorithms simulate a "survival of the fittest" evolution to 
search the solution space. Table 7 displays the scores 


obtained by various feature selection algorithms for 

different numbers of features from the simulation results. 

Multiple scores are produced by the eight classifiers 

based on their training and testing results (accuracy) on 
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fewer features. When comparing these scores, it is 
evident that the Chi-2 feature selection algorithm 
provided the RF classifier with the highest testing 
accuracy—96.99%—using 20 features. Conversely, the 
RF classifier achieved the second-highest score (96.25%) 
using 20 features with a different feature selection 
method. 

Figure 7 is the conclusion of the data presented in 
Table 7, which provides a summary of the previous 
findings. According to Table 7, In this research, the 
objective was to identify the most effective URL-based 
features for phishing detection by employing six different 
feature selection algorithms: Pearson, Chi-square (Chi-2), 
Logistic Regression (Logistics), Random Forest, Light 
Gradient Boosting Machine (Light GBM), and Recursive 
Feature Elimination (RFE). Each of these algorithms 
selected a set number of features from an initial pool, 
demonstrating an enhancement in detection accuracy 


Int. J. Exp. Res. Rev., Special Vol. 40: 73-89 (2024) 


WFS Features = All, 96.74 
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RFE Features = 20, 96.25 
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Figure 8. A test of the accuracy of the classifiers with all features and different numbers of features 


Table 8. Classifier accuracy (Test) for various feature selections 


Pearso Chi-2 RFE Logistic Rando LightGB WES 
n Ss m Forest M 
Featur Featur Featur 
Featur eo e6 EOL Feature Featur Features = All 
es: = 20) | | a | ee s=12 es = 13 =11 Biccdeeiers 
RF 96.07 96.99 96.25 95.02 96.02 94.57 96.74 
*WES — without feature selection 
detection, underscoring the importance of feature 


when used with various classifiers such as SVM (both 
linear and radial basis function), Logistic Regression 
(LR), Random Forest (RF), AdaBoost, Decision Tree 
(DT), K-Nearest Neighbors (K-NN), and Gradient 
Boosting Classifier (GBC). The feature selection 
processes resulted in a range of features being chosen by 
each algorithm, with Pearson, Chi-2, and RFE selecting 
up to 20 features each, and Light GBM, Random Forest, 
and Logistic Regression selecting fewer—12, 13, and 11 
features respectively. The selected features were then 
used to train and test the classifiers, and the accuracies 
were recorded. RF stood out, achieving the highest 
accuracy of 96.74%, indicating superior performance 
over other classifiers. The validation metrics including 
precision, recall, fl-score, and a confusion matrix further 
reinforced RF's efficacy. The research highlighted the 
impact of feature selection on the efficiency of phishing 
detection models. For instance, using the Chi-2 method, 
the RF classifier achieved a high accuracy of 96.99% 
with just 20 features, compared to 96.94% accuracy with 
all 31 features, showing that feature reduction can still 
preserve or even enhance model performance. This 


strategy not only simplifies the model but also optimizes 

computational efficiency without compromising detection 

capability. The findings suggest that a carefully selected 

subset of features can effectively support robust phishing 
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selection in building efficient security models in the 
cyber domain. Table 8 presents the classifier accuracy 
(Test) for various feature selections. Using Pearson with 
20 features, the accuracy is 96.07. For Chi-2 with 20 
features, it is 96.99. RFE with 20 features yields an 
accuracy of 96.25. Logistic regression with 12 features 
results in an accuracy of 95.02. Random Forest with 13 
features achieves an accuracy of 96.02, while LightGBM 
with 11 features gives 94.57. WFS with all features 
provides an accuracy of 96.74. 


Discussion 

In this study, the primary goal was to enhance the 
detection of phishing websites by selecting the most 
effective URL-based features using a variety of feature 
selection algorithms. These algorithms, detailed in Table 
1, include Pearson correlation, Chi-square (Chi-2), 
Logistic regression, Random Forest (RF), Light Gradient 
Boosting Machine (Light GBM), and Recursive Feature 
Elimination (RFE). By automating the feature selection 
process, these methods significantly improved detection 
accuracy. Table 3 highlights the number of features 
selected by each method. The efficiency of feature 
selection was demonstrated through the application of 
several classifiers, including Support Vector Machine 
(SVM) with both linear and radial basis function (rbf) 
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kernels, Logistic Regression (LR), RF, AdaBoost, 
Decision Tree (DT), K-Nearest Neighbors (K-NN), and 
Gradient Boosting Classifier (GBC). Each classifier was 
evaluated on the entire phishing dataset, with 
performance metrics such as precision, recall, f1-score, 
and confusion matrix presented in Table 6. Among these, 
RF achieved the highest accuracy of 96.74%, validating 
the effectiveness of the feature selection algorithms. 
Notably, Pearson, Chi-2, and RFE each selected 20 out of 
a possible 30 features (Table 7), resulting in high testing 
accuracies of 96.07%, 96.99%, and 96.25%, respectively. 


rates even with limited features. Our method is capable of 
identifying phishing sites in real time, offering better 
performance compared to existing solutions. Future work 
will enhance our model by incorporating webpage 
content analysis once a webpage is fully loaded on a 
user's device, providing a more robust defense by 
combining URL-based and content-based detection 
techniques. 
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Table 9. Comparing our approach to that of recent studies, Where NR— Not Reported. 


Author Method Nee ot Accuracy | Precision _ Recall 
Features 
. Hybrid Ensemble Feature 
CHEW eh |S clectign: (HERS) Cumilative.| 48:with 
al. re : 96.17% NR NR NR 
(2019) Distribution Function Dataset | 
gradient (CDF-g) Algorithm 
. Hybrid Ensemble Feature 
CHEW. EF. | Sctcction (HERS) Gunnilaiiye:||: TOW 
al. ae ‘ 94.60% NR NR NR 
(2019) Distribution Function Dataset | 
gradient (CDF-g) Algorithm 
' Hybrid Ensemble Feature 
CHIEW'et | Selection (HERS) Guimilative:| 30 with 
al. Bae oe : 94.27% NR NR NR 
(2019) Distribution Function Dataset 2 
gradient (CDF-g) Algorithm 
5 Hybrid Ensemble Feature 
SEN ee Se lechon (HERG) Contulat 5 with 
al. ieee eae 93.22% NR NR NR 
(2019) Distribution Function Dataset 2 
gradient (CDF-g) Algorithm 
Zhu et al. 
(2019) OFS-NN neural network 30 96.44% 94.78% 99.02% | 96.85% 
Ours RF 30 96.74% 96.31% 98.00% | 97.15% 
RF with Chi-2 F 
Ours We 20 96.99% NR NR NR 
Selection 
Conclusion and Future Work References 


Website phishing is a serious cyber threat that targets 
unsuspecting internet users, aiming to capture sensitive 
personal information such as usernames, passwords, and 
financial details. In our research, we explore effective 
methods for identifying fake websites, focusing on 
critical features that distinguish these sites. We introduce 
six strategies for selecting the most informative features 
to aid in the detection of phishing attempts. Additionally, 
we developed a strategy for detecting phishing websites 
using eight different machine-learning algorithms. 
Among these, the Random Forest (RF) classifier was 
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